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i A topological sorting and loop cleansing algorithm for a 
constrained MIMD compiler of shift-invariant flow graphs 

Lee, S.; Barnwell, T., Ill; 

Acoustics, Speech, and Signal Processing, IEEE International 
Conference on ICASSP f 86. , Volume: 11 , Apr 1986 
Page(s): 2927 -2930 



fAbstractl fPDF Full-Text (176 KB)1 IEEE CNF 



2 Structure-based automatic extraction of the program 
heterogeneity 

Guosun Zeng; Xinda Lu; Jingcun Wang; Dingkang Zhou; 
High Performance Computing in the Asia-Pacific Region, 2000. 
Proceedings. The Fourth International Conference/Exhibition on , 
Volume: 1 , 14-17 May 2000 
Page(s): 261 -262 vol.1 



[•Abstract! ["PDF Full-Text (140 KB)1 IEEE CNF 



3 Fast software implementation of MPEG advanced audio 
encoder 

Dimkoviae, L; Milovanoviae, D.; Bojkoviae, Z; 

Digital Signal Processing, 2002. DSP 2002. 2002 14th International 

Conference on , Volume: 2 , 1-3 July 2002 

Page(s): 839 -843 vol.2 



r Abstract] fPDF Full-Text (421 KB)1 IEEE CNF 



4 Experimental performance evaluation of the clustered 
multiprocessor system MUGEN 

Horiguchi, S.; Kawazoe, Y.; 

TENCON '89. Fourth IEEE Region 10 International Conference , 22-24 
Nov. 1989 
Page(s): 205 -208 



rAbstractl rPDF Full-Text (264 KB)1 IEEE CNF 



5 Data parallel computers and the FORALL statement 

Albert, E.; Lukas, J.D.; Steele, G.L, Jr.; 

Frontiers of Massively Parallel Computation, 1990. Proceedings., 3rd 
Symposium on the , 8-10 Oct. 1990 
Page(s): 390 -396 



rAbstractl fPDF Full-Text (480 KB)1 IEEE CNF 



6 Compiling SIMD programs for MIMD architectures 

Quinn, M.J.; Hatcher, P.J.; 

Computer Languages, 1990., International Conference on , 12-15 
March 1990 
Page(s): 291 -296 

[Abstract! rPDF Full-Text (440 KB^I IEEE CNF 



7 Fast barrier synchronization hardware 

Beckmann, C.J.; Polychronopoulos, CD.; 
Supercomputing '90. Proceedings of, 12-16 Nov. 1990 
Page(s): 180 -189 



rAbstractl rPDF Full-Text (676 KB^I IEEE CNF 



8 A decoupled access/ execute processor for matrix algorithms: 
architecture and programming 

Moreno, J.H.; Figueroa, M.E.; 

Application Specific Array Processors, 1991. Proceedings of the 
International Conference on , 2-4 Sept. 1991 
Page(s): 281 -295 



rAbstractl rPDF Full-Text (568 KB^] IEEE CNF 



9 FORGE 90: a parallel programming environment 

Levesque, J.M.; 

Compcon Spring '92. Thirty-Seventh IEEE Computer Society 



International Conference, Digest of Papers. , 24^8 Feb. 1992 
Page(s): 291 -294 



r Abstract! [PDF Full-Text (288 KB^I IEEE CNF 



io A global mode instruction minimization technique for 
embedded DSPs 

Wilson, T.C.; Grewal, G.W.; 

VLSI, 1996. Proceedings., Sixth Great Lakes Symposium on , 22-23 
March 1996 
Page(s): 18-21 



fAbstractl fPDF Full-Text (304 KB)1 IEEE CNF 



n Extracting SIMD parallelism from for 1 loops 

Gustin, V.; Bulic, P.; 

Parallel Processing Workshops, 2001. International Conference on , 
3-7 Sept. 2001 
Page(s): 23 -28 



fAbstractl fPDF Full-Text (360 KB^l IEEE CNF 



12 Exploiting operation level parallelism through dynamically 
reconfigurable datapaths 

Zhining Huang; Sharad, M.; 

Design Automation Conference, 2002. Proceedings. 39th , 10-14 June 
2002 

Page(s): 337 -342 
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13 Extending value reuse to basic blocks with compiler support 

Huang, J.; Ulja, D.J.; 

Computers, IEEE Transactions on , Volume: 49 Issue: 4 , April 2000 
Page(s): 331 -347 
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CiteSeer Find: | 

Searching for PHRASE pipeline loop. 

Restrict to: Header Title Order by: Citations Hubs Usage Date Try: Amazon B&N Google (Rl) Google 
(Web) CSB DBLP 

18 documents found. Order: citations weighted by year. 

Sentinel Scheduling for VLIW and Superscalar Processors - Mahlke (1992) (Correct) (36 citations) 
can be used in conjunction with software pipeline loop scheduling [4] or straight-line code 
ftp.crhc.uiuc.edu/pub/IMPACT/conference/asplos-92-sentinel.ps 

Using Iterative Compilation for Managing Software., - van der Mark. Rohou. al. (1999) (Correct) (5 citations) 
Our Work. 2. Loop Unrolling And Software Pipeline Loop Unrolling And Software Pipeline Are Two Major 
www.liacs.nl/~pmark/publications/scopes99.ps 

Software Pipelining with Register Allocation and Spilling - Wang. Krall. Ertl. Eisenbeis (1994) (Correct) (1 1 citations) 
4. t2=s*s 5. t3=t1*t2 6. a[t0]t3 (1) The Loop Pipeline Number Operation Latency Memory port 2 Load 
ftp.inria.fr/INRIA/Projects/a3/eisenbei/micro27.ps. Z 

Sentinel Scheduling with Recovery Blocks - David August Brian (1995) (Correct) (4 citations) 
execution -used in conjunction with software pipeline loop scheduling [7] or straight-line code 
www.crhc.uiuc.edu/IMPACT/ftp/report/crhc-95-05.sentinel.ps.Z 

Memory Access Optimization and RAM Inference for Pipeline.. - Weinhardt. Luk (1999) (Correct) (1 citation) 
during the iterations of the outer while loop. Pipeline vectorization consists of three main steps: 
www.doc.ic.ac.uk/-mw8/papers/fpl99.pdf.gz 

Mapping Loops on Coarse-Grain Reconfigurable Architectures.. - Lee, Choi. Putt (Correct) 
executions using a novel organization of the loop pipeline. We develop the conditions for sharing memory 
number of lines used for one instance of the loop pipeline, b) the number of configurations, c) the 
www.cecs.uci.edu/technical_report/TR02-34.pdf 

Acceleration of First and Higher Order Recurrences on - Processors With Instruction (Correct) 
Keywords recurrences, parallelism, software pipeline, loop optimization, height reduction, 
www.hpLhp.com/research/itc/car/papers/../papers/Acceleration.pdf 

Height Reduction of Control Recurrences for ILP Processors - Michael SchlanskerVinod (Correct) 
blocked back-substitution, software pipeline, loop optimization 1 Introduction Control and 
www.hpl.hp. co m/research/itc/car/papers/../papers/Control-height.pdf 

Parallelization of Control Recurrences for ILP Processors - Michael SchlanskerVinod (Correct) 
blocked back-substitution, software pipeline, loop optimization 1 Introduction Control and 
www.hpl.hp.co m/research/itc/car/papers/../papers/parallelization.pdf 

Building Parallel Applications using Design Patterns - Goswami. Singh. Preiss (2000) (Correct) 
supports patterns supporting replication, pipeline, loop and conditional constructs. Tracs is another 
www.pads.uwaterloo.ca/Bruno.Preiss/papers/published/2000/cser/paper.ps 

Memory Access Optimization for Reconfigurable Systems - Weinhardt. Luk (Correct) 

the innermost for loop (line 8-19) is such a pipeline loop. It contains the functions Fmin and Fmax (not 

xN2 x8 for (int y=0 yN2 ypipeline loop *9 if (xN &yN) 10 im_out[x,y] y2 

www.markus-weinhardt.de/papers/memopt.ps.gz 

Custom Embedded Counterflow Pipelines - Childers (2000) (Correct) 

instruction set architecture for the software pipeline loop. As part of this work, technigues were 
Figure 50: Software pipeline loop. 

to match the dynamic behavior of a kernel loop. Pipeline refinement is very simple: it identifies 
ftp.cs.vi rginia.edu/pub/dissertations/2000-06.ps.Z 
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1 Research sessions: implementation techniques: Implementing database operations using 85% 
@j SIMP instructions 

Jingren Zhou , Kenneth A. Ross 

Proceedings of the 2002 ACM SIGMOD international conference on Management of 
data June 2002 

Modern CPUs have instructions that allow basic operations to be performed on several data 
elements in parallel. These instructions are called SIMD instructions, since they apply a single 
instruction to multiple data elements. SIMD technology was initially built into commodity 
processors in order to accelerate the performance of multimedia applications. SIMD 
instructions provide new opportunities for database engine design and implementation. We 
study various kinds of operations in a database con ... 



2 MOM: a matrix SIMD instruction set architecture for multimedia applications 82% 
Qj Jesus Corbal , Roger Espasa , Mateo Valero 

Proceedings of the 1999 ACM/IEEE conference on Supercomputing (CDROM) January 
1999 

3 Integrating SIMD into the undergraduate curriculum 82% 
@| W. D. Maurer 

The Journal of Computing in Small Colleges , Proceedings of the sixth annual CCSC 
northeastern conference on The journal of computing in small colleges April 2001 
Volume 16 Issue 4 

Assembly language instruction today, in our view, should include instruction in the newly 
important area of single-instruction, multiple-data (SIMD) instructions. Such instructions are 
available on all major platforms, and they considerably speed up operations on arrays, 
particularly large arrays. This speedup is more pronounced with assembly language than with 
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algebraic language programming, and thus provides another reason for undergraduate 
students to learn assembly language. We discuss th ... 

4 Simulation and architecture evaluation: Vector vs. superscalar and VLIW architectures for 80% 
@) embedded multimedia benchmarks 

Christoforos Kozyrakis , David Patterson 

Proceedings of the 35th annual ACM/EEEE international symposium on 
Microarchitecture November 2002 

Multimedia processing on embedded devices requires an architecture that leads to high 
performance, low power consumption, reduced design complexity, and small code size. In 
this paper, we use EEMBC, an industrial benchmark suite, to compare the VIRAM vector 
architecture to superscalar and VLIW processors for embedded multimedia applications. The 
comparison covers the VIRAM instruction set, vectorizing compiler, and the prototype chip 
that integrates a vector processor with DRAM main memory. We de ... 

5 PACT 2001 workshops: MediaBreeze: a decoupled architecture for accelerating multimedia 80% 
3) applications 

Deependra Talla , Lizy K. John 

ACM SIGARCH Computer Architecture News December 2001 
Volume 29 Issue 5 

Decoupled architectures are fine-grain processors that partition the memory access and 
execute functions in a computer program and exploit the parallelism between the two 
functions. Although some concepts from the traditional decoupled access execute paradigm 
made its way into commercial processors, they encountered resistance in general-purpose 
applications because these applications are not very structured and regular. However, 
multimedia applications have recently become dominant workload on ... 

6 Exploiting SIMP parallelism in DSP and multimedia algorithms using the AltiVec technology 80% 
[?j Huy Nguyen , Lizy Kurian John 

Proceedings of the 13th international conference on Supercomputing May 1999 

7 Exploiting instruction level parallelism in geometry processing for three dimensional graphics 80% 
(3 applications 

Chia-Lin Yang , Barton Sano , Alvin R. Lebeck 

Proceedings of the 31st annual ACM/EEEE international symposium on 
Microarchitecture November 1998 

8 Growing discord: programming philosophy and hardware design 77% 
g) K. W. Neves 

Proceedings of the 1988 ACM/IEEE conference on Supercomputing November 1988 
Generally, vector compiler technology has been successful in achieving reasonable peak 
efficiency on “good” code. Moreover, the community's ability to generate 
“good” vector code has improved dramatically. As we move into the era of 
parallelism, particularly in supercomputing, we can observe certain trends among the leaders 
in compiler technology. The basic techniques are extensions of strategies for vector 
machines, but have limited effectiveness in a parallel envir ... 
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9 A survey of processors with explicit multithreading 77% 
@) Theo Ungerer , Borut Robi? , Jurij Silc 

ACM Computing Surveys (CSUR) March 2003 
Volume 35 Issue 1 

Hardware multithreading is becoming a generally applied technique in the next generation of 
microprocessors. Several multithreaded processors are announced by industry or already into 
production in the areas of high-performance microprocessors, media, and network 
processors. A multithreaded processor is able to pursue two or more threads of control in 
parallel within the processor pipeline. The contexts of two or more threads of control are 
often stored in separate on-chip register sets. Unused i ... 

10 Articles: Blurring Lines Between Hardware and Software 77% 
@j Homayoun Shahri 

Queue April 2003 
Volume 1 Issue 2 



11 Static resource models for code-size efficient embedded processors 77% 
13 Qin Zhao , Bart Mesman , Twan Basten 

ACM Transactions on Embedded Computing Systems (TECS) May 2003 
Volume 2 Issue 2 

Due to an increasing need for flexibility, embedded systems embody more and more 
programmable processors as their core components. Due to silicon area and power 
considerations, the corresponding instruction sets are often highly encoded to minimize code 
size for given performance requirements. This has hampered the development of robust 
optimizing compilers because the resulting irregular instruction set architectures are far from 
convenient compiler targets. Among other considerations, they int ... 

12 Ray tracing on programmable graphics hardware 77% 
3) Timothy J. Purcell , Ian Buck , William R. Mark , Pat Hanrahan 

ACM Transactions on Graphics (TOG) , Proceedings of the 29th annual conference on 
Computer graphics and interactive techniques July 2002 
Volume 21 Issue 3 

Recently a breakthrough has occurred in graphics hardware: fixed function pipelines have 
been replaced with programmable vertex and fragment processors. In the near future, the 
graphics pipeline is likely to evolve into a general programmable stream processor capable of 
more than simply feed-forward triangle rendering. In this paper, we evaluate these trends in 
programmability of the graphics pipeline and explain how ray tracing can be mapped to 
graphics hardware. Using our simulator, we analyze ... 

13 Energy aware compilation for DSPs with SIMP instructions 77% 
Q[j Markus Lorenz , Lars Wehmeyer , Thorsten Drager 

ACM SIGPLAN Notices , Proceedings of the joint conference on Languages, compilers 
and tools for embedded systems: software and compilers for embedded systems June 
2002 

Volume 37 Issue 7 

The growing use of digital signal processors (DSPs) in embedded systems necessitates the 
use of optimizing compilers supporting special hardware features. In this paper we present 
compiler optimizations with the aim of minimizing energy consumption of embedded 
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applications: This comprises loop optimizations for exploitation of SIMD instructions and 
zero overhead hardware loops in order to increase performance and decrease the energy 
consumption. In addition, we use a phase coupled code generator ... 



14 Embedded tutorial: Code generation for embedded processors 77% 
(3 Rainer Leupers 

Proceedings of the 13th international symposium on System synthesis September 2000 
The increasing use of programmable processors as IP blocks in embedded system design 
creates a need for C/C++ compilers capable of generating efficient machine code. Many of 
today's compilers for embedded processors suffer from insufficient code quality in terms of 
code size and performance. This violates the tight chip area and real-time constraints often 
imposed on embedded systems. The reason is that embedded processors typically show 
architectural features which are not well handled by class ... 



15 Polygon rendering on a stream architecture 77% 
[fj John D. Owens , William J. Dally , Ujval J. Kapasi , Scott Rixner , Peter Mattson , Ben 

Mowery 

Proceedings 2000 SIGGRAPH/EUROGRAPHICS workshop on on Graphics hardware 

August 2000 

The use of a programmable stream architecture in polygon rendering provides a powerful 
mechanism to address the high performance needs of today's complex scenes as well as the 
need for flexibility and programmability in the polygon rendering pipeline. We describe how a 
polygon rendering pipeline maps into data streams and kernels that operate on streams, and 
how this mapping is used to implement the polgyon rendering pipeline on Imagine, a 
programmable stream processor. We compare our resul ... 

16 Exploiting a new level of DLP in multimedia applications 77% 
Efj Jesus Corbal , Mateo Valero , Roger Espasa 

Proceedings of the 32nd annual ACM/IEEE international symposium on 
Microarchitecture November 1999 

This paper proposes and evaluates MOM: a novel ISA paradigm targeted at multimedia 
applications. By fusing conventional vector ISA approaches together with more recent 
SIMD-like (Single Instruction Multiple Data) IS As (such as MMX), we have developed a 
new matrix oriented ISA which efficiently deals with the small matrix structures typically 
found in multimedia applications. MOM exploits a level of DLP not reachable by neither 
conventional vector IS As nor SIMD-like media ISA extensi ... 



17 Optimizing the data cache performance of a software MPEG-2 video decoder 77% 
@) Peter Soderquist , Miriam Leeser 

Proceedings of the fifth ACM international conference on Multimedia November 1997 



18 PixelFlow: the realization 77% 
13 John Eyles , Steven Molnar , John Poulton , Trey Greer , Anselmo Lastra , Nick England , 
Lee Westover 

Proceedings of the 1997 SIGGRAPH/Eurographics workshop on Graphics hardware 

August 1997 
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Clf&$00f Find* Isingle instruction and loop and pipelir !:l lfe^riw^1 U <atritof»J ;; j 
Searching for single instruction and loop and pipeline. 

Restrict to: Header Title Order by: Citations Hubs Usage Date Try: Amazon B&N Google (Rl) Google 
(Web) CSB DBLP 

45 documents found. Order: citations weighted by year. 

The Superthreaded Architecture: Thread Pipelining with Run-time.. - Tsai, Yew (1996) (Correct) (57 citations) 
blocks need to be grouped together in a single instruction stream. As a larger instruction window size 
the superthreaded architectural model can exploit loop-level parallelism from a broad range of 
serious when a compiler attempts to software pipeline a loop with conditional branches [18]ln 
www-users.cs.umn.edu/Research/Agassiz/Paper/tsai.pact96.ps.Z 

One or more of the query terms is very common - only partial results have been returned. Try Google (Rl) . 

IMPACT: An Architectural Framework for.. - Chang, Mahlke.. (1991) (Correct) (91 citations) 

achieve solid speedup over high-performance single-instruction-issue processors. We ran experiments to 

function inline expansion, instruction placement, loop unrolling, loop peeling, memory disambiguation, 

[Kogge 81]By optimizing a simple instruction pipeline structure, current pipelined processors can 

ftp.crhc.uiuc.edu/pub/IMPACT/conference/isca-91-framework.ps 

Efficient Microarchitecture Modeling and Path Analysis for.. - Li, Malik. Wolfe (1996) (Correct) (32 citations) 
timing analysis. The execution time of a single instruction depends on many factors and varies more 
absence of dynamic structures and (iii) bounded loops. These restrictions can be imposed either through 
for modern processors due to the presence of pipelined instruction execution units and cached memory 
ftp.ee.princeton.edu/pub/yauli/rtss95.ps.gz 

Compiling Fortran D for MIMD Distributed-Memorv Machines - Hiranandani (1992) (Correct) (47 citations) 
Parallel Computer Forum (PCF) Fortran [24]Single-instruction, multiple-data (SIMD) machines such as the 
with explicit synchronization and parallel loops found in Parallel Computer Forum (PCF) Fortran 
www.cs.umd.edu/-keleher/papers/fortrand.ps.gz 

Zero-Cycle Loads: Microarchitecture Support for Reducing Load.. - Austin (1995) (Correct) (25 citations) 
levels of the data memory hierarchy in a single instruction. A significant body of work is dedicated to 
eliminates the entire load operation, or (loop) blocking eliminates many cache miss latencies. In 
up to two cycles earlier than traditional pipeline designs. For a pipeline with one cycle data 
ftp.cs.wisc.ed u/sohi/papers/1 995/micro.zcl.ps.gz 

The MMachine Multicomputer - Fillo. Keckler. Dally, al. (1995) (Correct) (1 7 citations) 

one for each ALU. All operations in a single instruction issue together but may complete out of 

parallelized byidentifying tasks, such as loop iterations, that can be distributed both across 

year #1 0#As a result, a 64-bit processor with a pipelined FPU #400M# 2 is only 1 1# of a 3.6G# 2 1 993 

publications.ai.mit.edu/ai-publications/pdf/AIM-1532.pdf 

The M-Machine Multicomputer - Marco Fillo (1995) (Correct) (1 7 citations) 

one for each ALU. All operations in a single instruction issue together but may complete out of 

parallelized by identifying tasks, such as loop iterations, that can be distributed both across 

year [10]As a result, a 64-bit processor with a pipelined FPU (400M 2 is only 11% of a 3.6G 2 1993 

cva.stanford.edu/pub/publications/AI1 532. ps.Z 

The Superthreaded Processor Architecture - Jenn-Yuan Tsai (1999) (Correct) (5 citations) 

from different basic blocks in a single instruction stream need to be examined and issued 

serious when a compiler attempts to pipeline a loop with many conditional branches [21]ln 

is especially serious when a compiler attempts to pipeline a loop with many conditional branches [21]ln 

www.cs.umn.edu/Research/Agassiz/Paper/tsai.ieee.ps.gz 

Cost-effective Hardware Acceleration of Multimedia Applications - Deependra Talla And (2001) (Correct) (1 citation) 
processors (GPPs) have been enhanced with single instruction multiple data (SIMD) execution units [1] 
packing/unpacking, permute, loads/stores, and loop branches) dominate media instruction streams 
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in determining the maximum clock rate? How many pipeline stages does the hardware add to the processor 
www.ece.utexas.edu/projects/ece/lca/ps/deepu-iccd01.pdf 

Caching and Predicting Branch Sequences for improved Fetch.. - Onder. Xu. Gupta (1999^ (Correct) (2 citations) 

two as their execution is separated by a single instruction that performs a compare (c =EOLN) 

in Figure 1 , assume that during the execution of a loop iteration both the conditionals evaluate to false, 

for speculative execution in the processor pipeline. Thus, if the predictions are correct the 

www.cs.arizona.edu/people/gupta/research/Publications/Comp/pact99.ps 

Performance Nonmonotonicities: A Case Study of the UltraSPARC. - Kushman (1998) (Correct) (3 citations) 
exist in which the addition or removal of a single instruction changes the performance of a program by a 
in 6 cycles per iteration. 29 3-13 Assembly code loop which is executed at the maximum execution rate of 
3-1. 20 3-3 The nine-stage pipeline of the UltraSPARC. 
www.lcs.mit.edu/publications/pubs/pdf/MIT-LCS-TR-782.pdf 

Comparing Software and Hardware Schemes For Reducing the.. - Hwu. Conte. Chang (1989) (Correct) (16 citations) 
compare instructions [8]9]A case for single-instruction conditional branches is given in [6]When 
that backward branches are usually at the end of loops. In the study done by J. E. Smith [4]the 
disrupt the flow of instructions through the the pipeline, increasing the overall execution cost of branch 
ftp.crhc.uiuc.edu/pub/IMPACT/conference/isca-89-branch.ps 

Improving Software Pipelining With Unroll-and-Jam - Cam Ding, Sweanv (1996) (Correct) (5 citations) 
that in our simple machine an add requires a single instruction to produce a result while the multiplier 
been developed [1, 2, 3, 4] Unfortunately, not all loops have enough parallelism in the innermost loop 
www.cs.rice.edu/~cding/documents/unroll.ps.gz 

Vector Instruction Set Support for Conditional Operations - Smith Greg Faanes (2000) (Correct) (1 citation) 

(typically several 10s to 100s) in a single instruction. These instructions are executed in a 

Support for conditional operations (as occur in loops containing IF statements) is an important aspect 

with SIMD implementations, but long vector, pipelined implementations have a number of advantages and 

www.ece.wisc.edu/-jes/papers/iscaOO.smith.ps 

Using Measurements to Derive the Worst-Case Execution Time - Lindoren, Hansson. Thane (2000) (Correct) 
(1 citation) 

machine timing effects that depend on a single instruction and its immediate neighbors. Examples of 
flow analysis (such as number of iterations in loops) with low-level timing information. This paper 
show how it can be extended to architectures with pipelines. 1 . Introduction Worst-case execution time 
www.mrtc.mdh.se/publications/0257.ps 

Sentinel Scheduling with Recovery Blocks - David August Brian (1995) (Correct) (4 citations) 

files as the maximum number of branches any single instruction is speculated above. Given that some 

- used in conjunction with software pipeline loop scheduling [7] or straight-line code scheduling 

execution -used in conjunction with software pipeline loop scheduling [7] or straight-line code 

www.crhc.uiuc.edu/IMPACT/ftp/report/crhc-95-05.sentine!.ps.Z 

An Implementation Of Guror*: A Software Pipelining Algorithm - Bockhaus (1992) (Correct) (2 citations) 
: 2 2.1: A Single Instruction (Node) in the EPS Machine Model, 
graph :17 3.2.4 Pipelining the loop body :19 
www.crhc.uiuc.edu/IMPACT/ftp/report/ms-thesis-john-bockhaus.ps.Z 

Program Balance and its Impact on High Performance RISC. - Lizy Kurian (1995) (Correct) (1 citation) 
to the importance of memory bandwidth. Single instruction stream parallelism will not be much 
defined two metrics called ^machine balance' and loop balance' which together indicate how efficiently a 
Access/Execute Balance, Memory Bandwidth, Pipeline Balance, Program Behavior. 1 Introduction 
www.ece.utexas.edu/~ljohn/raleigh.ps 

First 20 documents Next 20 
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Restrict to: Header Title Order by: Citations Hubs Usage Date Try: Amazon B&N Google (Rl) Google 
(Web) CSB DBLP 

45 documents found. Order: citations weighted by year. 

Parallel Processing for Volume Visualization - Silva (1992) (Correct) (1 citation) 

this paper. Of interest to our work are the single-instruction stream, multiple-data stream (SIMD) and the 

to a SIMD algorithm by replacing each inner loop with a single broadcast instruction that 
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